[1] 56884 24
[1] 56884 24
[1] 56884 24
Our immune system, comprising both specific and unspecific cells, plays a crucial role in protecting organisms from pathogens. However, recent discoveries suggest that the scope of pathogen defense extends beyond the traditional immune system, with structural cells contributing significantly. This project aims to elucidate a novel aspect of pathogen defense mechanisms provided by structural cells. The initial section provides an extensive overview of our methodologies and findings, referencing pivotal studies that laid the groundwork for our research. We critically analyze the strengths of our approach, highlighting potential areas for further improvement and exploration. In the second section, we delve into an organ-based analysis utilizing cutting-edge techniques such as RNA sequencing (RNAseq) and Assay for Transposase-Accessible Chromatin sequencing (ATACseq). This detailed examination sheds light on the unique roles and regulatory mechanisms of structural cells in various organs. The third section is dedicated to comparing the effectiveness and insights gained from each analytical tool, specifically focusing on their utility in organ-based research. This comparative analysis enhances our understanding of the intricate interactions within structural cells across different organs. Our project’s goal transcends mere replication of existing analyses. By juxtaposing our findings with those of previous authors, we aim to validate and build upon their results. This approach not only corroborates the existing data but also enriches the interpretation, offering a more profound and nuanced understanding of structural cells’ role in pathogen defense.
Immune cells can be broadly classified into two categories: Innate Immunity and Adaptive Immunity. Each category comprises various cell types with distinct roles in the immune response. However, even though these cell types such as T-cells, B-cells, Dendritic cells and Granulocytes have a fundamental importance in the defence of the organism, recent research has found structural cells to play an imporant role in the response to infection and inflammation (Minton 2020) (Krausgruber et al. 2020) (Tuong and Clatworthy 2020) (Gomes and Teichmann 2020). Specifically, our cells of interest include epithelial cells, endothelial cells and fibroblasts.
| Cell Type | Location | Function and Role in Pathogen Defense |
|---|---|---|
| Epithelial Cells | Present on organ surfaces | Provide more than just support; involved in responses to infection either directly or through interactions with immune cells. |
| Fibroblasts | Part of the connective tissue | Help maintain the extracellular matrix material surrounding cells. |
| Endothelial Cells | Line the interior of vessels (e.g., blood vessels) | Involved in responses to infection either directly or through interactions with immune cells, alongside epithelial cells. |
The main area of research in this case, mainly focuses on describing how structural cells can have an impact on the protection of the organism through the interaction with cells from the immune system. The major takeaways include:
Unrealized Potential in Promoter Accessibility: The study identifies open, accessible promoters with low levels of expression, termed as ‘unrealized potential.’ This suggests these genes are likely poised for a rapid response upon infection, indicating a pre-emptive genetic mechanism in anticipation of pathogenic challenges.
Similarity Across Structural Cells within the Same Organ: A significant observation is that different structural cell types within the same organ exhibit more similarities with each other than the same cell type does across different organs. This finding underscores the influence of the organ-specific microenvironment on cellular behavior and genetic expression.
Distinct Genetic Responses to LCMV and Cytokines: The research highlights that the genetic reaction of structural cells varies significantly between Lymphocytic Choriomeningitis Virus (LCMV) infection and cytokine stimulation. This differential response opens avenues for potential new combination therapies, tailored to specific pathogens and inflammatory stimuli.
Our analysis will mainly focus on the first two points, using RNA-seq and ATAC-seq pipelines as our main tool for investigation. We will use RNA-seq to quantify gene expression levels, and use ATAC-seq to assess the chromatin accessibility.
We are working with bulk RNA-seq data and bulk ATAC-seq data from the paper “Structural cells are key regulators of organ-specific immune responses” by Krausgruber et al. The paper focuses on identifying unrealized epigenetic potentials in structural cells across different organs in mice (mus musculus). We are working with a subset of the data, namely the structural cells of 8 organs; the brain, heart, lung, liver, spleen, kidney, large intestine and small intestine. There are 3 types of structural cells investigated in the paper which we will work with, namely endothelial, epithelial and fibroblast cells.
The RNA-seq data is obtained from 3 different replicates of C57BL/6J mice, whilst the ATAC-seq data was obtained from 2 different replicates. Both the RNA-seq data and the ATAC-seq data were obtained using Illumina single-end sequencing.
We obtain our RNA-seq and ATAC-seq data from SRA by running the accession list through the fasterq-dump tool.
cat SRR_Acc_List.txt | xargs fasterq-dump --threads 8 --progress
We use FastQC in order to obtain our quality control results both for our RNA-seq data and our ATAC-seq data.
For our RNA-seq data we obtain failing grades for per base sequence content and per sequence GC content, which are commonly ignored. Just like in ATAC-seq, we receive warnings for sequence duplication warnings, however, for RNA-seq data this is section can be ignored. FastQC will yield these errors for our data as it was originally intended to use whole genome data #fact check
As for the overrepresented sequences, a trimming is often applied if there are also many samples that fail in adapter content. Unlike our ATAC-seq quality control, this is not the case here. For the few samples for where there are failing grades, we will rely on using soft trimming during the mapping step using STAR. Thus, we continue with the data as-is into our mapping.
We obtain warnings and failing grades for adapter content for our endothelial and epithelial data, along with failing grades for sequence duplication levels. In the quality control reports for adapter content there is a presence the Nextera transposase sequences, which are commonly used in ATAC-seq data obtained from Illumina. In order to remove the adapter content, we use fastp to trim the sequences for the adapter. As we want to trim the Nextera transposase sequences, we specify that we wish to trim CTGTCTCTTATACACATCT. In order to address the sequence duplication levels present in the different, we wish to perform a deduplication step. We use fastp for both our trimming and our deduplication.
fastp --dedup --adapter_sequence CTGTCTCTTATACACATCT -l 25 -i $file -o trimmed_dedup/$file
After performing the deduplication and trimming steps, we rerun FastQC and obtain our new quality control results.
We see that the deduplication and trimming steps fix our failing grades for adapter content, successfully removing the Nextera adapters. The sequence deduplication levels now also receive passing grades, indicating a successfull deduplication. We are still left with warnings for the sequence length distribution and per base sequence content, however these are commonly obtained warnings and are often ignored if the other sections are fine.
The mapping is done with STAR for RNA-seq and with Bowtie2 for ATAC-seq. We will use the GRCm39 mouse genome from the UCSC Genome Browser as our reference genome, for which we will build an index.
For mapping ATAC-seq data with Bowtie2, we use the following settings in our bash script:
bowtie2 --very-sensitive \
-p 16 \
-X 2000 \
-x genome/indexed_m39_genome \
-U rawdata/${id}*.fastq.gz \
-S genome/${id}.sam
We then use Samtools in order to convert the .sam files to .bam files.
for f in *.sam
do
samtools view -Sb $f > $f.bam
done
For our RNA-seq data, we use the following settings for STAR:
STAR --genomeDir genome/star-index \
--runThreadN 10 \
--readFilesIn ${id}.fastq.gz \
--readFilesCommand zcat \
--sjdbGTFfile genome/gencode.vM33.annotation.gtf \
--quantMode TranscriptomeSAM \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix mapping_transcriptome/$id/
After having completed mapping we will look at the quality of the mapping for our ATAC-seq and RNA-seq data.
For our endothelial RNA data, we see a large variety in the amount of uniquely mapped reads, the amount of reads to multiple loci and unmapped reads. We have some outliers with very poor unique mapping percentages, with SRR11256465 having a below 10% mapping rate, and with SRR11256377 and SRR11256377 having a unique mapping percentage of below 20%.
Note that not all mapping percentages add up to 100%, as STAR also reports statistics such as “% of reads mapped to too many loci” and “% unmapped reads: other” which usually make up <2% of the mapping percentage which we have chosen not to include in our figure.
For the endothelial RNA data, the result is similar to our endothelium data in that the results are quite varied. We can again identify outliers; SRR11256459 and SRR11256560 are have unique mapping percentages around the 10% mark, with SRR11256442 and SRR11256458 having unique mapping percentages around 25%.
For the samples with unique mapping percentages around 10%; SRR11256459 and SRR11256460, the number of uniquely mapped reads is 1198017 and 1998939 respectively.
Again, for the fibroblast RNA data we observe that the mapping rates for uniquely mapped, unmapped and reads mapped to multiple loci is varying across samples. For fibroblast data, SRR11256383, SRR11256384 and SRR11256401 have ~10% uniquely mapping reads. We will continue with the samples that have a lower mapping rate in the differential expression analysis, however, we will keep an eye on these samples in case outliers in the differential expression analysis correlate to poorly mapped samples.
Before moving on to our RNA-seq analysis, we use RSEM in order to obtain normalized gene counts, which we will use as our input for the RNA-seq analysis.
rsem-calculate-expression --alignments \
-p 10 \
mapping_transcriptome/$id/Aligned.toTranscriptome.out.bam \
genome/rsem_grcm39_genecode33/rsem_grcm39_genecode33 \
rsem/$id/$id
We load in the RSEM data and retrieve additional metadata such as the ensembl_gene_id and mgi_symbol from Ensembl.
We have 56884 annotated genes and 24 samples for each of our three types of structural cells.
To focus on the specific genes we’re interested in, we can examine the average expression level of these genes in each sample in our dataset.
Distributions of the average counts per sample:
We look at in how many samples a gene was detected.
In order to get a better understanding of the distribution of our detected genes, we create a violin plot showing the number of genes detected across samples over the count for the different types of structural cells.
From this plot we can see we have a high number of genes which are barely expressed or completely unexpressed in the three cell types. Hence, we select the expressed genes and mark them in the meta dataframe to trace them for our following analysis. Moreover, from the violin plot we can see the distribution of the expression of the three types of cells is distributed as a negative-binomial distribution.
We filter away the unexpressed genes based on a criteria rowMeans(expr_endo > 0) >= 0.5 | rowMeans(expr_endo) >= 1, the count of genes pertinent to our analysis has now reduced to 14478 for endothelial cells, 15461 for epithelial cells, and 15783 for fibroblasts. We have added a column to mark these relevant genes. Moving forward, our analysis will focus exclusively on this subset of genes.
expressed_endo <- rowMeans(expr_endo > 0) >= 0.5 | rowMeans(expr_endo) >= 1
expressed_epi <- rowMeans(expr_epi > 0) >= 0.5 | rowMeans(expr_epi) >= 1
expressed_fibro <- rowMeans(expr_fibro > 0) >= 0.5 | rowMeans(expr_fibro) >= 1
meta_genes$expressed_endo <- expressed_endo
meta_genes$expressed_epi <- expressed_epi
meta_genes$expressed_fibro <- expressed_fibro[1] 14478
[1] 15461
[1] 15783
To assess the correlation between samples whilst disregarding information about sample covariates, we will use Spearman’s rank correlation coefficient (SCC). This choice is due to our data not conforming to normal or log-normal distributions. Unlike Pearson’s correlation coefficient (PCC), SCC is non-parametric, making it more adaptable and robust to the nature of our data distribution.
Therefore, we will generate two hierarchical clustering trees to analyze our samples. This approach will help us better understand the clustering patterns of our samples, specifically focusing on ‘Organ’ (our variable of interest) and ‘Replicate’ (our covariate).
Furthermore, we will also use the hierarchical clustering trees to observe the behaviour of our outlier samples, as we have marked the outliers from the mapping phase in red.
The samples display clear clustering by organ type, indicating that the organ from which they were derived has a significant impact on the results in all of the three cell types. In contrast, the distribution of replicate numbers across the clusters appears to be random, suggesting that it does not markedly affect the organ-specific clustering.
Regarding the outliers, there is some evidence of clustering among them in the three cell types. However, this may not necessarily indicate an anomaly, as all outliers are sourced from similar organs (Spleen and Liver), which naturally tend to group together in the analysis, regardless of their outlier status. However, in the epithelial clustering, there is one clear outlier for Spleen which does not cluster well with any of the other samples. Thus, as it has a poor unique mapping rate and does not seem to agree with the rest of our data, we choose to remove it from further downstream analysis.
[1] 56884 23
[1] 23 8
We did attempt to first perform batch correction using ComBat, but found that it had no effect on the quality of our clustering. Instead of using ComBat, we will account for the batch effect introduced by the replicates by incorporating it as a covariate in our differential expression analysis.
For the differential expression, we want to compare three methods: anova test, DESeq2, and edgeR. The following table compares the three methods for differential expression analysis:
| Method | Description | Key Features |
|---|---|---|
| ANOVA Test | ANOVA (Analysis of Variance) is a statistical method used to compare means of two or more groups. In the context of gene expression, it can be used to determine if the expression of a gene is significantly different across different conditions or groups. | - Suitable for comparing multiple groups. - Assumes normal distribution and homogeneity of variances. - Less suited for count data typical in RNA-seq. |
| DESeq2 | DESeq2 is a statistical method tailored for RNA-seq data analysis. It’s designed to normalize count data and model variance for differential expression analysis. | - Uses negative binomial distribution to model RNA-seq data. - Adjusts for library size and other normalization factors. - Provides methods for estimating variance and significance of changes. |
| edgeR | edgeR is another popular tool for analyzing RNA-seq count data. Similar to DESeq2, it models count data using a negative binomial distribution but has some differences in its approach to normalization and variance estimation. | - Effective for small sample sizes. - Uses an overdispersed Poisson model to account for biological variability. - Employs empirical Bayes methods for estimating dispersions. |
Each method has its strengths and is best suited for different types of data sets and research objectives. The choice of method should be guided by the specific characteristics of the data and the research goals.
Anova test assumes a normal distribution across our data, which cannot be the case with expression data, but sometimes the approximation still works. Therefore, in the first place, we perform an anova test to then compare the results with two more advanced methods, namely edgeR and DESeq2.
# function to perform the differential expression test with ANOVA
ANOVA_test <- function(expr,
cond,
ctrl = NULL,
covar = NULL,
padj_method = p.adjust.methods){
pval_fc <- data.frame(t(pbapply(expr, 1, function(e){
dat <- data.frame(y = log1p(e),
cond = cond)
if (! is.null(covar))
dat <- data.frame(dat, covar)
m1 <- lm(y ~ ., data = dat)
m0 <- lm(y ~ . - cond, data = dat)
test <- anova(m1, m0)
pval <- test$Pr[2]
avgs <- tapply(log1p(e), cond, mean)
if (! is.null(ctrl) && sum(cond %in% ctrl) > 0){
fc <- exp(max(avgs[names(avgs) != ctrl]) - avgs[ctrl])
} else{
fc <- exp(max(avgs) - min(avgs))
}
return(c(pval = unname(pval), fc = unname(fc)))
})), row.names = rownames(expr))
padj <- p.adjust(pval_fc$pval, method = padj_method)
return(data.frame(pval_fc, "padj" = padj)[,c("pval","padj","fc")])
}Since ANOVA assumes a normal distribution across data, we make use of the Negative Binomial (NB) distribution using edgeR or DESeq2. We can perform the edgeR and DESeq2 analyses to test if genes across organs are truly DE genes or the differential expression is just caused due to the different replicates, hence we model them with organs and replicate numbers combined.
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
transcripts missing from tx2gene: 4332
summarizing abundance
summarizing counts
summarizing length
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
transcripts missing from tx2gene: 4332
summarizing abundance
summarizing counts
summarizing length
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
transcripts missing from tx2gene: 4332
summarizing abundance
summarizing counts
summarizing length
For DESeq2, we use raw counts matrix as input (that is not normalized), and specify our design matrix as ~ Organ + Replicate, and then later reduce by Replicate when taking the likelihood ratio test.
# pipeline for endothelial cells
dds_endo <- DESeqDataSetFromTximport(txi_endo,
colData = meta_endo,
design = ~ Organ+Replicate)
dds_filtered_endo <- dds_endo[intersect(rownames(expr_endo)[meta_genes$expressed_endo], rownames(dds_endo)),]
dds_filtered_endo <- DESeq(dds_filtered_endo, test="LRT", reduced= ~ Replicate)
res_DESeq2_raw_endo <- results(dds_filtered_endo)Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
design formula are characters, converting to factors
Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
design formula are characters, converting to factors
For edgeR, we also use the raw counts.
# pipeline for endothelial cells
grp_endo <- meta_endo$Organ
group_endo <- factor(grp_endo)
rep_endo <- meta_endo$Replicate
replicate_endo <- factor(rep_endo)
design_endo <- model.matrix(~ replicate_endo + group_endo)
cts_endo <- txi_endo$counts
normMat_endo <- txi_endo$length
normMat_endo <- normMat_endo/exp(rowMeans(log(normMat_endo)))
normCts_endo <- cts_endo/normMat_endo
y_endo <- DGEList(counts=cts_endo, group=group_endo)
y_endo <- calcNormFactors(y_endo)
y_endo <- scaleOffset(y_endo, normMat_endo)
y_endo <- estimateDisp(y_endo,design_endo)
fit_endo <- glmQLFit(y_endo,design_endo)
qlf_endo <- glmQLFTest(fit_endo, coef=4:10)
res_edgeR_endo <- topTags(qlf_endo, n=Inf, p.value=0.1)$tableWe use quasi-likelihood F-test (QL) instead of the likelihood ratio test because it is better for DE analysis of RNA-bulk data for edgeR.
We can now compare the p-values obtained from the anova test, DESeq2, and edgeR.
Comparison in endothelial cells:
Warning: Removed 3137 rows containing missing values (`geom_point()`).
Warning: Removed 2263 rows containing missing values (`geom_point()`).
Warning: Removed 3137 rows containing missing values (`geom_point()`).
We see that the p-values are quite different between the different methods. As we want to be stringent with which genes we investigate, we proceed with only the differential expressed genes which were found using both DESeq2 and edgeR. We note that this is a conservative estimate.
ENDOTHELIAL CELLS RESULTS:
max_layer_DEG_endo
Brain Heart Kidney LargeIntestine Liver
387 16 171 55 85
Lung SmallIntestine Spleen
163 88 56
EPITHELIAL CELLS RESULTS:
max_layer_DEG_epi
Brain Heart Kidney LargeIntestine Liver
286 30 42 117 63
Lung SmallIntestine Spleen
38 67 24
FIBROBLASTS RESULTS:
max_layer_DEG_fibro
Brain Heart Kidney LargeIntestine Liver
173 44 68 30 49
Lung SmallIntestine Spleen
40 35 54
ATAC-seq, short for Assay for Transposase-Accessible Chromatin using sequencing, is a powerful technique widely utilized in molecular biology. This method offers valuable insights into the chromatin accessibility of a genome. Chromatin accessibility is defined as the ease with which DNA-binding proteins, including transcription factors, can access and interact with specific regions of the DNA. When working with ATAC-seq, there can be the possibility to see the state of the chromatin, something not really possible when considering RNA-seq data.
To introduce this analysis, we briefly go through the main steps: 1. Alignment is the first activity when starting studying some ATAC-seq data. In this step, our files are transformed from SAM to BAM. In our analysis we obtain BAM files of all the three cell types after the alignment. 2. Sorting BAM files and obtaining the indexing of these files. Sorting a BAM file involves arranging the reads in a file based on their chromosomal coordinates. This is essential for efficient retrieval of specific reads or regions of interest. Without the sorting step, searching for specific reads would require scanning through the entire file, which can be time-consuming and computationally expensive. Indexing a sorted BAM file creates a small auxiliary index file that stores information about the position of reads in the sorted file. This index file enables efficient random access to specific reads or regions of interest without having to resort to scanning the entire file. After those activities, there is the possibility of continuing the computational pipeline obtaining BigWig files and calling mac2 callpeaks. However, since the indexing requires a lot of time to run, therefore we analysed intermediate steps where we load intermediate results and continue the pipeline, avoiding to run all the steps de novo.
outBAM_dir <- "ATACseq/atac_fibro_bam"
bam_files <- list.files(outBAM_dir, pattern = "\\.bam$", full.names = TRUE)
# Process each BAM file
for(file in bam_files) {
# Generate a new filename for the sorted BAM file
sortedBAM <- file.path(dirname(file), paste0("Sorted_", basename(file)))
# Sort the BAM file
sortBam(file, gsub("\\.bam$", "", basename(sortedBAM)))
# Index the sorted BAM file
indexBam(sortedBAM)
}
# This code is used for the Bigwig files
sortedBAM_files <- gsub("SRR", "Sorted_SRR", bam_files)
# In case of multiple files
openRegionBigWig <- gsub("\\.sam\\.bam", "_openRegions.bw", sortedBAM_files)
# loop over each file to process open regions and export to bigWig
for(i in seq_along(sortedBAM_files)) {
# convert the 'open' reads to a GRanges object
atacFragments_Open <- GRanges(atacReads_Open_list[[i]])
# calculate coverage and export to BigWig format
export.bw(coverage(atacFragments_Open), openRegionBigWig[i])
}
# this code is used for peak calling
output_dir <- "/Users/guidoputignano/Research/stat426/Project/Main/ATACseq/fibro_peaks"
genome_size <- "mm"
# Assuming sortedBAM_files is your vector of sorted BAM file paths
for(i in seq_along(sortedBAM_files)) {
# Extract the file name without the extension for naming in MACS2
file_name <- tools::file_path_sans_ext(basename(sortedBAM_files[i]))
# Construct the MACS2 command with both output and error redirected to the log file
log_file <- paste(output_dir, "/", file_name, ".peak.log", sep="")
macs2_command <- paste("macs2 callpeak -t", sortedBAM_files[i],
"--nomodel --shift -100 --extsize 200 --format BAM -g", genome_size,
"-n", file_name, "--outdir", output_dir,
"&>", log_file, "2>&1")
# Print the command for debugging
cat("Executing: ", macs2_command, "\n")
# Run the MACS2 command using system()
system(macs2_command)
}Depending on the cell type, there are some differences in the mapped regions. It is possible to use Rsubread’s propmapped to give us number of mapped and unmapped reads, as well as the mapping percentage.
NumTotal NumMapped PropMapped
Sorted_SRR11256132.sam.bam 29803943 25068048 0.841098
Sorted_SRR11256141.sam.bam 31968002 28083168 0.878477
Sorted_SRR11256142.sam.bam 35762434 30775413 0.860551
Sorted_SRR11256147.sam.bam 35263264 24978789 0.708352
Sorted_SRR11256148.sam.bam 36312991 24967149 0.687554
Sorted_SRR11256149.sam.bam 30197589 29340889 0.971630
Sorted_SRR11256155.sam.bam 25218027 22081869 0.875638
Sorted_SRR11256156.sam.bam 29112339 24550883 0.843315
Sorted_SRR11256161.sam.bam 42215289 22855868 0.541412
Sorted_SRR11256162.sam.bam 37862042 22554752 0.595709
Sorted_SRR11256163.sam.bam 23988934 14919581 0.621936
Sorted_SRR11256164.sam.bam 36096888 23835954 0.660333
Sorted_SRR11256169.sam.bam 22022067 19701760 0.894637
Sorted_SRR11256170.sam.bam 26985544 23218356 0.860400
Sorted_SRR11256185.sam.bam 34175861 29754122 0.870618
Sorted_SRR11256186.sam.bam 28223374 24335072 0.862231
Sorted_SRR11256199.sam.bam 28611200 26533234 0.927372
NumTotal NumMapped PropMapped
Sorted_SRR11256134.sam.bam 32272960 28915274 0.895960
Sorted_SRR11256135.sam.bam 32729942 31236532 0.954372
Sorted_SRR11256143.sam.bam 31821569 24660918 0.774975
Sorted_SRR11256144.sam.bam 20012557 18561408 0.927488
Sorted_SRR11256150.sam.bam 20936639 16020708 0.765200
Sorted_SRR11256151.sam.bam 23001938 17272338 0.750908
Sorted_SRR11256152.sam.bam 44497728 37177497 0.835492
Sorted_SRR11256157.sam.bam 25297847 18170335 0.718256
Sorted_SRR11256158.sam.bam 47336880 34607260 0.731085
Sorted_SRR11256165.sam.bam 32511208 21578171 0.663715
Sorted_SRR11256166.sam.bam 33400278 19292257 0.577608
Sorted_SRR11256171.sam.bam 28905397 24655072 0.852957
Sorted_SRR11256172.sam.bam 25085109 11113922 0.443049
Sorted_SRR11256187.sam.bam 23118274 19323418 0.835850
Sorted_SRR11256188.sam.bam 28824516 24525778 0.850865
Sorted_SRR11256189.sam.bam 36439451 32117618 0.881397
Sorted_SRR11256196.sam.bam 27304395 26539000 0.971968
Sorted_SRR11256206.sam.bam 28181973 20675618 0.733647
NumTotal NumMapped PropMapped
Sorted_SRR11256136.sam.bam 37487953 34950305 0.932308
Sorted_SRR11256137.sam.bam 46047718 42335509 0.919383
Sorted_SRR11256145.sam.bam 30270677 28506556 0.941722
Sorted_SRR11256146.sam.bam 45193814 41597951 0.920435
Sorted_SRR11256153.sam.bam 25228337 23369741 0.926329
Sorted_SRR11256159.sam.bam 24965279 22860470 0.915691
Sorted_SRR11256167.sam.bam 17401849 16013731 0.920232
Sorted_SRR11256168.sam.bam 25711420 24867010 0.967158
Sorted_SRR11256173.sam.bam 23342779 20685629 0.886168
Sorted_SRR11256174.sam.bam 27748692 24721918 0.890922
Sorted_SRR11256190.sam.bam 28247388 25961781 0.919086
Sorted_SRR11256191.sam.bam 18355217 17241130 0.939304
Sorted_SRR11256195.sam.bam 23565670 22611273 0.959501
These three tables describe the mapping of epithelial cells, endothelial cells, and fibroblasts respectively.
| Metric | Description |
|---|---|
| Mapped Reads | Mapped Reads are the sequence reads that have been successfully aligned to a reference genome. A read is considered “mapped” if the alignment software finds a location in the reference genome where the read matches or closely matches, taking into account possible sequencing errors or genetic variations. Mapped reads are critical for most downstream analyses, such as variant calling, gene expression analysis, etc. |
| Unmapped Reads | Unmapped Reads are the reads that could not be aligned to the reference genome. There could be several reasons for a read being unmapped, such as the read being too short, having poor quality, containing too many errors or ambiguities, or originating from a region not present in the reference genome (like a novel insert or a contaminant). |
| Mapping Percentage | Mapping Percentage is the percentage of total reads that have been successfully mapped to the reference genome. It is calculated as the number of mapped reads divided by the total number of reads (mapped plus unmapped), multiplied by 100. This metric is crucial for assessing the quality of the sequencing data and the effectiveness of the alignment process. |
In case there may be the interest of knowing more about the sorted files we can use some functions like:
Here we provide the description only in the case of the sample SRR11256152, since loop through all samples would be very time-consuming.
| Category | Count | Description |
|---|---|---|
| Total Reads | 44497728 + 0 | QC-passed reads + QC-failed reads |
| Primary Reads | 44497728 + 0 | Primary reads |
| Secondary Reads | 0 + 0 | Secondary reads |
| Supplementary Reads | 0 + 0 | Supplementary reads |
| Duplicates | 0 + 0 | Duplicate reads |
| Primary Duplicates | 0 + 0 | Primary duplicate reads |
| Mapped | 37177497 + 0 (83.55%) | Mapped reads (Percentage Mapped) |
| Primary Mapped | 37177497 + 0 (83.55%) | Primary mapped reads (Percentage Mapped) |
| Paired in Sequencing | 0 + 0 | Paired reads in sequencing |
| Read1 | 0 + 0 | First reads in pairs |
| Read2 | 0 + 0 | Second reads in pairs |
| Properly Paired | 0 + 0 | Properly paired reads |
| With Itself and Mate Mapped | 0 + 0 | Reads with itself and mate mapped |
| Singletons | 0 + 0 | Singleton reads |
| With Mate Mapped to Different Chr | 0 + 0 | Reads with mate mapped to a different chromosome |
| With Mate Mapped to Different Chr (mapQ>=5) | 0 + 0 | Reads with mate mapped to a different chromosome with mapQ>=5 |
In the context of genomic data, identifiers like “GL456210.1,” “JH584295.1,” and “MU069434.1” represent specific sequences that are typically not part of the primary assembly of a genome. If you may want to compare the genome with a reference, you have to remove those elements, which are not of interest in our analysis, considering only the main chromosomal regions present in our read. Making use of the chromosomal regions, it is possible to see the sample was a male, as also seen from the Y chromosome. Moreover, we can note there is an evident decrease in the mapped value from the Chromosome 1 until reaching the last chromosomes. That is something normal when considering mice chromosomes.
Distribution of the mapped reads depending on the sequence
Across lots of function for the alignment, we decided readGAlignments is the best function in our case. In general, there is also the possibility of using the function readGAlignmentPairs(), which is used when there are two reads (not our case). Considering the sample SRR11256152:
| Metric | Description |
|---|---|
| Total Reads | There are 44,497,728 total reads, all of which are primary (none are secondary or supplementary). |
| Mapped Reads | Out of the total reads, 37,177,497 are mapped, which corresponds to an 83.55% mapping rate. This indicates that the majority of your reads align to the reference genome. |
| Paired-End Reads | The output indicates that there are 0 paired in sequencing, meaning there are no paired-end reads in your dataset. |
| Properly Paired Reads | The count of 0 properly paired further confirms the absence of paired-end reads in the dataset. |
When considering this analysis, the paper suggested several genes expressed depending on the cell type.
Gene Tissue CellType
1 Nppc Heart Endothelium
2 Dner Heart Epithelium
3 Ptpn7 Kidney Endothelium
4 Mr1 Liver Epithelium
5 Cxcr4 Liver Epithelium
6 Hlx Liver Epithelium
7 Fcer1g Liver Fibroblasts
8 Actr3 Liver Fibroblasts
9 F11r Liver Fibroblasts
10 Ncstn Liver Fibroblasts
11 Atf6 Spleen Endothelium
12 Arpc2 Spleen Epithelium
13 Psmd1 Spleen Epithelium
14 Atf6 Spleen Epithelium
15 Sumo1 Spleen Epithelium
16 Casp8 Spleen Epithelium
17 Nppc Spleen Fibroblasts
18 Bend6 Spleen Fibroblasts
After obtaining the GRanges file, we can focus our understanding on the Chromosome 1 that has the majority of mapped regions. In this case, it may be interesting to study the length of the reads through the function width(atacReads), where each number represents the length of a read, and all the first six reads have a length of 51 nucleotides. This means each of these reads is 51 bases long.
Distribution of frequency of Read Mapping Quality (MapQ) Scores taking the example of SRR11256152.
[1] 51 51 51 51 51 51
[1] 0 0 0 0 0 0
This result is expected, as our reads are single-end reads. If our reads were double-end reads, the result would have been different. Moreover, in this second case we should have used the function readLengths, instead of insertSizes, to see the length of our reads.
Min. 1st Qu. Median Mean 3rd Qu. Max.
44.00 51.00 51.00 50.99 51.00 57.00
In the context of sequencing data, especially for single-end reads, the read length is typically consistent across all reads if the data comes from a sequencing run with a fixed read length setting. The provided output suggests that the sequencing data might have a uniform read length of 51 bases per read, which is common in many sequencing platforms where you specify a certain read length for the entire run. In any case, there are also some variation when considering the whole data.
With ATAC-seq, we are interested in knowing both ends of the DNA fragments generated by the assay, since the ends indicate where the transposase inserted. This can be done only with paired-end reads. A single-end read defines only one end of the fragment, and, although computational methods to infer fragment lengths from single-end reads exist, they tend to be inaccurate. With single-end reads, there is only one position to compare, and so any reads whose 5’ ends match are considered duplicates. Thus, many false positives may result, and perfectly good reads are removed from further analysis
The gals is the form of a large list in this case.
The PT score, or Promoter-Transcript score, is a metric used in genomic studies, particularly in the analysis of chromatin immunoprecipitation sequencing (ChIP-seq) data. The concept behind the PT score is to compare the level of a certain genomic feature (like protein binding, histone modifications, or other epigenetic marks) at promoter regions versus the rest of the transcript (transcript body).
The Nucleosome Free Regions (NFR) score is a metric used in genomics to identify and quantify regions in the genome that are devoid of nucleosomes. Nucleosomes are the basic units of DNA packaging in eukaryotic cells and consist of a segment of DNA wound around a core of histone proteins. Regions that are free of nucleosomes are often indicative of regulatory elements, such as promoters, enhancers, or other DNA-protein interaction sites, and are thus of great interest in the study of gene regulation and chromatin structure.
TSS enrichment score is a ratio between aggregate distribution of reads centered on TSSs and that flanking the corresponding TSSs. TSS score = the depth of TSS (each 100bp window within 1000 bp each side) / the depth of end flanks (100bp each end). TSSE score = max(mean(TSS score in each window)). TSS enrichment score is calculated according to the definition at https://www.encodeproject.org/data-standards/terms/#enrichment. Transcription start site (TSS) enrichment values are dependent on the reference files used.
In this case, we can consider the low quality control, duplicates and signal distributions in several situations. We can take one example into account:
Sample Reads Filt. RiP. RiBL.
1: Sorted_SRR11256132.sam.bam 1724877 12.7 11.3 1.620
2: Sorted_SRR11256141.sam.bam 2025523 23.0 10.1 0.582
3: Sorted_SRR11256142.sam.bam 2214846 24.2 13.4 0.613
| Metric | Description |
|---|---|
| Reads | The total number of reads sequenced or processed. |
| Filt. | Short for “Filtered.” Represents the number or percentage of reads filtered out due to poor quality or other criteria. |
| RiP. | Possibly “Reads in Peaks.” Denotes the percentage of reads that fall into the identified peaks, indicative of regions of interest. |
| RiBL. | Likely “Reads in BlackList Regions.” Indicates the percentage of reads mapping to blacklist regions, known to produce unreliable or artifact-prone signals. |
In this case, there is just an example of some of elements of the data.
| Metric | Description |
|---|---|
| Mapped | The total number of reads that were successfully aligned to the reference genome. |
| Dup_Percent | The percentage of reads that are duplicates of other reads, which can be indicative of amplification bias or other issues in sequencing. |
| Genomic Feature | Description |
|---|---|
| Promoter (<=1kb) | Regions within 1 kilobase (kb) upstream of the transcription start site (TSS) of a gene. Promoters are crucial for gene regulation, serving as the binding site for RNA polymerase and other transcription factors. |
| Promoter (1-2kb) | Regions located 1 to 2 kb upstream of the TSS. |
| Promoter (2-3kb) | Regions located 2 to 3 kb upstream of the TSS. Promoter regions are typically involved in the initiation of transcription and can vary in distance from the actual TSS. Different ranges are specified to understand the distribution of peaks concerning the proximity to the TSS. |
| 5’ UTR | The untranslated region (UTR) immediately upstream of the coding sequence of a gene on the 5’ end. It’s involved in the regulation of translation and can affect the stability, localization, and efficiency of mRNA translation. |
| 3’ UTR | The untranslated region immediately downstream of the coding sequence of a gene on the 3’ end. Similar to the 5’ UTR, it plays roles in post-transcriptional regulation, affecting mRNA stability and regulation of translation. |
| 1st Exon | The first exon of a gene, which includes both the translated and untranslated regions. It is significant as it often contains both regulatory elements and the beginning of the coding sequence. |
| Other Exon | Refers to all exons in a gene except for the first. Exons are the parts of the gene sequence that are expressed into mRNA and then translated into protein. |
| 1st Intron | The first intron of a gene, which is the non-coding sequence immediately following the first exon. Introns are spliced out during mRNA processing but can contain important regulatory elements and splice sites. |
| Other Intron | Refers to all introns in a gene except for the first. Introns can vary greatly in size and can be involved in gene regulation. |
| Downstream (<=3kb) | Regions located within 3 kb downstream of the transcription end site (TES) of a gene. These areas can contain regulatory elements that influence gene transcription termination, mRNA stability, or other post-transcriptional controls. |
| Distal Intergenic | Regions that are far from known genes, typically not falling within any of the specified promoter, intronic, or exonic regions. These areas might be involved in long-range regulation or may contain as-yet-uncharacterized genes or regulatory elements. |
When analysing the three cell types, we performed several analysis, among them, we wanted to understand the annotation of the genomic data for chromosome 1 (chr1) generated by the MACS (Model-based Analysis of ChIP-Seq) tool. In this case, we noticed that the cells had really similar concentration of Promoter, UTR, Introns and all the other components. For this reason, we are displaying a plot to show the distribution in the three cases. This case in particular describes the epithelial cells, but it can be generalised to all the other cell types.
GRanges object with 810 ranges and 9 metadata columns:
seqnames ranges strand | annotation geneChr
<Rle> <IRanges> <Rle> | <character> <integer>
[1] chr1 4877903-4878119 * | Promoter (<=1kb) 1
[2] chr1 4927718-4928073 * | Promoter (<=1kb) 1
[3] chr1 6284937-6285205 * | Promoter (<=1kb) 1
[4] chr1 7158790-7159439 * | Promoter (<=1kb) 1
[5] chr1 9615343-9615687 * | Promoter (<=1kb) 1
... ... ... ... . ... ...
[806] chr1 194621056-194621287 * | Promoter (<=1kb) 1
[807] chr1 194658452-194659371 * | Promoter (<=1kb) 1
[808] chr1 194774808-194775008 * | Promoter (<=1kb) 1
[809] chr1 194813227-194813567 * | Promoter (<=1kb) 1
[810] chr1 194813771-194814075 * | Promoter (<=1kb) 1
geneStart geneEnd geneLength geneStrand geneId transcriptId
<integer> <integer> <integer> <integer> <character> <character>
[1] 4878046 4916958 38913 1 18777 NM_001355712
[2] 4927917 4968132 40216 1 21399 NM_001159750
[3] 6284886 6346328 61443 1 12421 NM_009826
[4] 7159144 7243852 84709 1 319263 NM_183028
[5] 9615633 9617680 2048 1 59014 NM_021511
... ... ... ... ... ... ...
[806] 194621129 194643600 22472 1 12490 NM_001111059
[807] 194638067 194659267 21201 2 320400 NR_033564
[808] 194723546 194774556 51011 2 17221 NR_132621
[809] 194781019 194813878 32860 2 12946 NM_001355060
[810] 194781019 194813878 32860 2 12946 NM_001355060
distanceToTSS
<numeric>
[1] 0
[2] 0
[3] 51
[4] 0
[5] 0
... ...
[806] 0
[807] 0
[808] -252
[809] 311
[810] 0
-------
seqinfo: 1 sequence from mm39 genome
After performing the analysis, there was the possibility of reviewing some genetic expression and how they would related to the use of Integrative Genome Viewer (IGV).The Integrative Genomics Viewer (IGV) is a high-performance genome browser used for visualizing and exploring large-scale genomic datasets. It provides researchers and scientists with a powerful tool to interactively view and analyze diverse genomic data. Obtaining our BigWig and bed files, there was the possibility of analysing them using IGV or the UCSC Genome Browser. The UCSC Genome Browser is a web-based tool that allows researchers to visualize and explore genomic data. It provides a user-friendly interface for viewing annotated genome sequences and various genomic annotations, making it a valuable resource for researchers in genomics, genetics, and related fields. In this case, it’s possible to note that there are .bed files that have been useful to identify the location of the peaks, while BigWig files is particularly useful to identify the signal intensities from high-throughput sequencing experiments. For example, when considering the relationship between BigWig and bed files, it’s possible to notice that .bed files would show the presence of peaks, but for the quantificiation BigWig files can be helpful:
In any case, .bed files can be helpful when alaysing how many peaks can be present in a particular cell type. For example, when analysing endothelial cells, Mixi file wasn’t highly expressed, and the quantity of peaks in bed files was really close to 0. At the same time, when considering F11r, Actr3, Fcer1g as examples, they were immune genes expressed in the case of fibroblasts, but some of them also had an expression in other cells, such as in endothelial cells.
The conclusion can be divided looking either at The RNA-seq analysis, or at the ATAC-seq analysis. Our analysis mostly focused on the evaluation of unrealized Potential in Promoter Accessibility and the Similarity Across Structural Cells within the Same Organ.
In the case of RNA-seq, after performing our analysis we analysed the DE in each organ noticing that genes expression shows huge variability across different organs, which indicates there are highly specific patterns of expression, defining that cells tend to be similar organs compared to cell types. In this case, we used different tools to avoid a bias related to the analysis used. For this reason, we noticed that the statement on the similarity across structural cells within the same organ was correct.
In the case of ATAC-seq, we wanted to verify that the conclusion present in the case of RNA-seq was also correct. After the analysis of some genes such as F11r, Actr3, Fcer1g, there was an understanding that the cell type may not be highly influential compared to the organ in which the cells are. Indeed, some genes that could be influential in the case of fibroblasts, are also present in endothelial cells. Moreover, the higher level of similarity between organs can also be noticed with the similarities in the case of the genomic data for chromosome 1 (chr1) and the PT score, Nucleosome Free Regions (NFR) score, and TSS enrichment score, whose results were quite similar among cell types. When looking at unrealised potentials, it was also possible to notice the presence of some of those immune genes whose chromatin was not open in the case of steady state scenario. For this reason, also the first hypothesis was correct.
In order to achieve the goal of our project, we made use of various heterogeneous resources: